A critical evaluation of the content validity of patient-reported outcome measures assessing health-related quality of life in children with cancer: a systematic review

Background With increasing survival rates in pediatric oncology, the need to monitor health-related quality of life (HRQOL) is becoming even more important. However, available patient-reported outcome measures (PROMs) have been criticized. This review aims to systematically evaluate the content validity of PROMs for HRQOL in children with cancer. Methods In December 2021, a systematic literature search was conducted in PubMed. PROMs were included if they were used to assess HRQOL in children with cancer and had a lower age-limit between 8 and 12 years and an upper age-limit below 21 years. The COSMIN methodology for assessing the content validity of PROMs was applied to grade evidence for relevance, comprehensiveness, and comprehensibility based on quality ratings of development studies (i.e., studies related to concept elicitation and cognitive interviews for newly developed questionnaires) and content validity studies (i.e., qualitative studies in new samples to evaluate the content validity of existing questionnaires). Results Twelve PROMs were included. Due to insufficient patient involvement and/or poor reporting, the quality of most development studies was rated ‘doubtful’ or ‘inadequate’. Few content validity studies were available, and these were mostly ‘inadequate’. Following the COSMIN methodology, evidence for content validity was ‘low’ or ‘very low’ for almost all PROMs. Only the PROMIS Pediatric Profile had ‘moderate’ evidence. In general, the results indicated that the PROMs covered relevant issues, while results for comprehensiveness and comprehensibility were partly inconsistent or insufficient. Discussion Following the COSMIN methodology, there is scarce evidence for the content validity of available PROMs for HRQOL in children with cancer. Most instruments were developed before the publication of milestone guidelines and therefore were not able to fulfill all requirements. Efforts are needed to catch up with methodological progress made during the last decade. Further research should adhere to recent guidelines to develop new instruments and to strengthen the evidence for existing PROMs. Supplementary Information The online version contains supplementary material available at 10.1186/s41687-023-00540-8.


Background
In recent decades, survival rates in pediatric oncology have increased considerably [1][2][3]. Even though overall survival remains the primary outcome [4], patients' health-related quality of life (HRQOL) also needs careful monitoring and management. HRQOL as defined by the World Health Organization (WHO) is an "individual's perception of their position in life […] incorporating in a complex way individuals' physical health, psychological state, level of independence, social relationships, personal beliefs and their relationships to salient features" [5]. Depending on context and target population, different aspects are relevant for HRQOL. For children with cancer, Anthony et al. [6] have provided the most comprehensive conceptual framework so far. It covers four major domains: physical (symptoms, physical functioning), psychological (emotional distress, behavior, positive psychological function, self-esteem, body image, cognitive health), social (relationships, social functioning), and general health (health perception) [6].
In clinical routine and research, HRQOL is commonly assessed by patient-reported outcome measures (PROMs). In pediatrics, PROMs are often complemented with caregiver-reports. However, patient-and caregiverreports often differ, especially for less observable outcomes that are only accessible from patient perspective (e.g., perceived burden, satisfaction with relationships) [7][8][9][10][11][12]. Several studies have indicated that children from 8 years onwards can reliably self-report [13][14][15]. Thus, it is recommended to treat patient-reports as the most important source of information in this age-group [7,16]. This is in line with a trend towards increasing the involvement and empowerment of children in research and treatment [17][18][19].
To assess HRQOL from children's perspective, evidence-based and age-appropriate PROMs are needed that meet psychometric quality criteria [20]. The most fundamental measurement property is content validity, defined as "the degree to which the content […] is an adequate reflection of the construct(s) to be measures" [20]. Claims regarding content validity can only be made when an instrument comprehensively assesses relevant aspects in a comprehensible way [21,22].
To ensure content validity, PROM development guidelines strongly recommend patient involvement in several stages [15,21,[23][24][25][26]. They suggest involving patients in concept elicitation and issue generation to give their opinion on relevance and comprehensiveness. Later in the process, guidelines request cognitive interviews to evaluate whether item formulations, response-options, and recall-periods are understood as intended.
For children from the age of 8 years, recall-periods from 7 days to 4 weeks and faces-scales with ≤ 6 faces or Likert-scales with ≤ 5 points are usually considered suitable [24,27]. Adolescents and young adults (AYAs) around 14 years or older can complete the same tools as adults [28], but they face distinct HRQOL issues as they transition into adulthood [29,30].
Previous research has indicated that children with cancer were insufficiently involved in the development of existing PROMs [31]. It has been questioned whether they measure what is relevant for children [32], and whether they are complete [33] and of sufficient psychometric quality [31,34].
The present systematic review aims to systematically evaluate the content validity of available PROMs for HRQOL in children with cancer aged between 8 and 14 years. To do so, the COSMIN methodology for assessing the content validity of PROMs [21, 22; COS-MIN = COnsensus-based Standards for the selection of health Measurement INstruments] is applied. In a recently published review, this methodology was used to evaluate PROMs measuring positive psychological constructs [35]. Previous reviews using the COSMIN methodology to evaluate PROMs for pediatric oncology [34,36,37] were based on an older version [38][39][40], which was less comprehensive. The previous COSMIN guideline did not cover the key concept of comprehensibility, and its standards only checked whether certain steps were undertaken, without evaluating the methodological quality [22]. Thus, it is expected that ratings based on the old version will vary considerably from ratings based on the current version.

Methods
This systematic review follows the Preferred Reporting Items of Systematic Reviews and Meta-analyses (PRISMA) guidelines, where applicable [41]. The PRISMA checklist is provided in Additional file 1. At the time when we started to work on this review, it was not possible to register the protocol since common platforms (e.g., PROSPERO) accepted COVID-19-related protocols only. Thus, no protocol has been published.

Search strategy and study selection
A literature search was conducted on PubMed in December 2021 combining Medical Subject Headings (MeSH) related to HRQOL, the target population of children with cancer, and psychometrics: ("Quality of Life" [MeSH] AND (Neoplasms [MeSH] OR "Medical Oncology" [MeSH]) AND (Child [MeSH] OR Pediatrics [MeSH]) AND ("Self Assessment" [MeSH] OR "Patient Reported Outcome Measures" [MeSH] OR "Patient Outcome Assessment"[MeSH] OR "Self Report" [MeSH] OR "Psychometrics"[MeSH])). The search was neither limited to a specific time-period nor filtered for specific languages.
As a first step, abstracts were screened by one reviewer [MR] to identify PROMs for HRQOL assessment used in children with cancer within the age range between 8 and 14 years. This included generic and cancer-specific instruments but excluded survivor-specific instruments. PROMs primarily addressing adolescents (lower agelimit at ≥ 12) were excluded, but PROMs for transitional age-groups (children and adolescents) were included if the upper age-limit did not exceed 21 years. A PROM was considered relevant if the developers claimed to assess HRQOL or if it covered physical, psychological, and social health, as described in the conceptual framework by Anthony et al. [6]. PROMs assessing single symptoms or adverse effects were excluded (e.g., PedsQL Fatigue scale [42] or separate PROMIS-scales [43]).
To ensure that all relevant PROMs were included, the list of PROMs was compared to a list of 112 instruments identified by Algurén et al. for the development of the Overall Pediatric Health Standard Set (OPH-SS) [44] and a list of 155 PROMs collected in a simultaneously conducted review of HRQOL issues in children with cancer [45]. For all included instruments, manuals and review copies were searched. If not accessible, authors were contacted. Data regarding their main characteristics were extracted [MR], i.e., the target population (age, diagnoses), recall-period, response-options, the number of items, and the intended scale structure as well as whether a parent-version was available (see Table 1).
In a second step, full-texts and their reference-lists were screened by one reviewer [MR] to identify development and content validity studies for the investigated PROMs. The inclusion and exclusion criteria were based on the definitions provided by the COSMIN guidelines: Development studies include all studies on concept elicitation and studies testing PROMs under development, e.g., cognitive interview studies. Content validity studies include all studies that investigate the relevance, comprehensiveness, and/or comprehensibility of existing PROMs in a new sample. Additional searches on Pub-Med were conducted with PROM-names and "develop*" or "content valid*" to check whether further relevant studies were available. The included studies were evaluated according to the COSMIN guidelines (see below).

The COSMIN methodology for assessing content validity
The COSMIN methodology for assessing content validity is divided into three so-called 'boxes' with several 'standards' [22,46]. Box 1 evaluates the quality of PROM development, including general design (definition of construct, target population, and context/purpose; 35 standards), concept elicitation (7 standards), and cognitive interviews (22 standards).
Box 2 evaluates the quality of content validity studies, defined as studies on the relevance, comprehensiveness, and comprehensibility of existing PROMs performed in new samples [22]. The standards in box 2 assess whether and how patients were asked about relevance (standards 1-7), comprehensiveness (standards [8][9][10][11][12][13][14], and comprehensibility (standards [15][16][17][18][19][20][21], and whether and how professionals were asked about relevance (standards [22][23][24][25][26] and comprehensiveness (standards 27-31). As caregivers play an important intermediary role in pediatrics, we wanted to take their input into account as well. After consulting with the COSMIN Group, we decided to use the standards for expert involvement (standards [22][23][24][25][26][27][28][29][30][31] to rate whether and how caregivers were asked about relevance and comprehensiveness. In box 3, the results of development and content validity studies are rated against ten criteria for good content validity. Additionally, reviewers were asked to give their own ratings of comprehensiveness, relevance, and comprehensibility of the tool (eight standards). In terms of comprehensibility, ratings for response-options and recall-periods were based on recommendations from a recent review by Coombes et al. [27]. Item-formulations were rated positive, except if items appeared obviously inappropriate for children. For consistent relevance and comprehensiveness ratings, the items of all PROMs were systematically categorized by content, as described below.
In a final step, the overall ratings are summarized and the quality of evidence is graded. Following the COSMIN guidelines, evidence is rated 'low' or 'very low' if there has been no content validity study of at least 'doubtful' quality. If content validity has not been sufficiently assessed, the development process needs to be of 'adequate' or 'very good' quality to obtain a 'moderate' evidence level. For evidence to obtain a 'high' rating, there needs to have been at least one content validity study of 'adequate' or 'very good' quality.
The ratings of boxes 1 and 2 were conducted by two reviewers independently [MR, AM], using the Excelsheet available from the COSMIN website (cosmin. nl). We made minor adaptations to this sheet by adding columns for the reviewers to justify their decisions. Conflicts were discussed until consensus was reached. The ratings of box 3 and the final evidence grading were performed by one reviewer [MR] and approved by all co-authors.

Categorizing items by the contents assessed
To provide a uniform and solid basis for reviewers' ratings of comprehensiveness and relevance, items  from all investigated PROMs were extracted into an Excel-file and mapped onto the conceptual framework by Anthony et al. [6]. Within this hierarchical framework, the domains of physical, psychological, and social health were further divided into subdomains, containing several identifying concepts. For example, physical health is divided into symptoms (e.g., pain, fatigue) and physical function (e.g., dexterity, mobility), while social health is divided into relationships (e.g., with family or peers) and social function (e.g., recreation and leisure, school). The psychological domain has the most subdomains and is divided into emotional distress (e.g., afraid, sad), behavior (e.g., clingy, defiant), positive psychological function (e.g., benefit finding), self-esteem (e.g., feeling loved or proud), body image (e.g., personal appearance), and cognitive issues (e.g., attention, remembering). Each item was assigned to one domain, subdomain, and identifying concept by one reviewer [MR]. Openended questions, conditional items (filter-questions), and determinant questions (on background information of the patient) were not taken into account. To enable a consistent categorization across all items, we defined categorization rules (Additional file 2). A second reviewer [DR] indicated his (dis)agreement per item. Conflicts were discussed until consensus was reached. Where necessary, new subdomains and identifying concepts were added to complement the conceptual framework (Additional file 3).
Descriptive statistics were applied to investigate the representation of contents within the overall item pool and the questionnaires. Item content was considered relevant if it could be assigned to one of the subdomains. Questionnaires were considered comprehensive when they covered physical health and social health (at least family/general) and several aspects of psychological health, i.e., negative emotional health issues (emotional distress or treatment burden), positive issues (positive psychological functioning or self-esteem), and cognitive issues.

Identification of PROMs and their main characteristics
As shown in Fig. 1, the literature search identified 231 articles and screening for PROMs resulted in a list of nine inventories (i.e. measurement systems / questionnaire providers). Two of them provided different modules (e.g., generic and cancer-specific), resulting in 12 different PROMs. Taking versions of different length into account, 17 questionnaires were identified. Counterchecking against the PROMs collected for the development of the OPH-SS [44] and our review of HRQOL issues [45] did not yield any additional instruments. For the included PROMs, 53 development and content validity studies

Contents assessed by included PROMs
For all but one PROM (SQOLPOP), review copies or item lists were found. Four-hundred different items were retrieved, some of which belong to more than one lengthversion or module. Of these 400 items, 22 were excluded as open-ended questions, determinant, or conditional items. No conflicts occurred in defining the question type.
The remaining 378 items were assigned to one of the domains, subdomains, and identifying concepts within the conceptual framework by Anthony et al. [6]. The reviewers agreed upon the categorization of 94.97% of items (359/378). The few conflicts were easily resolved, and the complementation of the HRQOL model for content categorization was discussed [MR, DR] (Additional file 3). The categorizations were adapted accordingly [MR], and the final categorization was approved again [DR].
Most items from the overall item pool cover psychological aspects. As displayed in Fig. 2  Upon closer inspection of the different PROMs ( Fig. 2), it is apparent that the generic instruments and core scales (except for the PedsQL Generic Core Scale) assess less physical and more social issues than instruments designed for children with chronic diseases or cancer. In contrast, the PROMIS Pediatric Profile and the PedsQL Brain Tumor Module have the strongest focus on physical health, with approximately 50% of their items being dedicated to this domain. Cognitive issues are mostly represented in the PedsQL Brain Tumor and Cancer Modules, but not covered in the PROMIS Pediatric Profile. Additional file 4 provides more detail.

Quality ratings of development studies
The ratings obtained for the quality of development studies are displayed in Table 2, including justifications for ratings other than 'very good' (V). For most instruments, a clear definition of the construct to be measured, the target population, and the context was given. For the KINDL-R Oncology module, these points remained 'doubtful' , as no development study was available. The SQOLPOP obtained an 'inadequate' rating, because the development study did not clarify which dimensions this questionnaire should capture [67].
The involvement of the target population in concept elicitation was rated 'inadequate' (five PROMs) or 'doubtful' (five PROMs) for most PROMs. In some cases, no children were involved in the development studies (PAC-QOL, SQOLPOP, TAC-QOL). For other PROMs, methods were described insufficiently. For example, for the PedsQL modules, it remains unclear how they were derived from the previous PCQL.
For four instruments, no cognitive interviews were conducted (KINDL-R Oncology, PedsQL Generic, PedsQL Cancer, TACQOL), in another three cases, it remained 'doubtful' whether they were conducted in the target population (PedsQL Brain Tumor, QOLCC-7-12, SQOLPOP). The remaining studies solely investigated comprehensibility, whereas comprehensiveness was often not investigated (DISABKIDS, KIDSCREEN, KINDL-R Generic, PAC-QOL). All but one had to be rated as 'doubtful' or even 'inadequate' for comprehensiveness, mostly because it remained unclear whether the identified difficulties were addressed and because items were not appropriately (re-)tested in their final form. The PROMIS Pediatric Profile was the only instrument, for which 'very good' methods were applied and reporting was good. Nevertheless, it received an 'adequate' rating only, because most items were tested in five or six patients, while a 'very good' rating would have required seven or more patients per item.
The total rating for the development was based on the quality of concept elicitation and the quality of cognitive interview studies. The overall development was of 'inadequate' quality for eight PROMs and of 'doubtful' quality for another three PROMs. Only the PROMIS Pediatric Profile was informed by an 'adequate'-almost 'very good'-development procedure.

Quality ratings of content validity studies
Quality ratings for content validity studies are provided in Table 3, including justifications for ratings other than 'very good' (V). Content validity studies were only conducted for three PROMs, the DISABKIDS, the KINDL-R Generic Module, and the QOLCC-7-12. For all three, quality was rated 'inadequate' . The QOLCC-7-12 was only evaluated with five healthcare-experts, but no patients or caregivers were involved [65,100]. For the DISABKIDS, only a few written comments by children and parents were taken into account, while focus groups were held with nurses [55]. Furthermore, it is questionable whether the comments resulted in any adaptations. In the study investigating the KINDL-R Generic Module, children were asked to rate the relevance and comprehensibility of the whole questionnaire, but not for each item individually [76].

Rating of results and evidence grading
Following the COSMIN methodology, the development and content validity studies of mostly 'doubtful' or 'inadequate' quality can only provide 'very low' or 'low' evidence for the relevance, comprehensiveness, and comprehensibility of nearly all investigated PROMs. Only the PROMIS Pediatric Profile, with its 'adequate'-almost 'very good'-development procedure can rely on a 'moderate' evidence base for the three components of content validity. The quality of evidence for each PROM is displayed in Table 4, together with ratings of the results.
Due to the 'very low' evidence for most PROMs, the ratings often rely on reviewers' ratings. As no review copy was available for the SQOLPOP, only 'indeterminate' ratings could be given for this instrument. For all other measures, ratings of results for relevance and comprehensiveness were based strictly on the content categorization described before. Relevance was rated as 'sufficient' because all items could be mapped onto the conceptual model of HRQOL. However, the comprehensiveness of seven PROMs was rated as 'insufficient' , mostly because cognitive issues or positive psychological functioning were missing.
As all instruments have age-appropriate recall-periods and response-options, reviewers' comprehensibility ratings were positive and/or followed the study results. Only for the KINDL-R Oncology Module, did reviewers rate the comprehensibility as 'insufficient' , because its design is considerably complex. In   were not tested in their final form and analysis and results were only described vaguely [48,73].
Even though very good methods applied, rating off CIs can only be 'adequate' , because 93% of the items have been tested in 5 or 6 patients only [85], which is insufficient to obtain a 'very good' rating in the COSMIN manual (7 required). Additional items developed by Quinn et al. [89] have been tested in at least 5 children per item. Further item reduction relied on quantitative methods only (methods described in [103]; results per domain: [87,[90][91][92][93].    this PROM, some items require three responses: For symptoms, children must indicate frequency and the resulting burden. For treatment-or procedure-related issues, a conditional item is followed by frequency and burden ratings.

Discussion
The quality assessment of development, cognitive interview, and content validity studies showed that none of the investigated PROMs has a solid evidence base for its content validity. For most instruments, evidence is 'very low' , only the PROMIS Pediatric Profile is based on 'moderate' evidence. Overall, the scarce evidence available indicates that the PROMs cover relevant issues, while evidence for comprehensiveness and comprehensibility is partly inconsistent or indicates that these have not been sufficiently fulfilled.

Methodological shortcomings and possible explanations
The reasons for this low evidence level can be found in the study design, methodological quality, and insufficient reporting. As already stated by Klassen et al. [31], patients were not sufficiently involved. Guidelines on patient involvement in PROM development as well as reporting guidelines did only appear after most instruments had been developed. Thus, the developers of the investigated PROMs could not yet benefit from their guidance. The concept of content validity in particular has not been clearly defined for a long time.

Missing qualitative studies and patient involvement
Most of the PROMs were developed in the 1990s or early 2000s, before the publication of milestone policies by the European Medicines Agency (EMA) [108] and the American Food and Drug Administration (FDA) [109] and methodological guidelines on PROM development or content validity around 2010, e.g., by the International Society for Pharmacoeconomics and Outcomes Research Patient Reported Outcome Good Research Practices Task Force (ISPOR PRO) [24][25][26]110] or the PROMIS developers [85,86]. This might explain poor or inconsistent methods and reporting. However, missing or 'inadequate' development studies could be compensated by qualitative content validity studies to strengthen the evidence for existing tools. As an example, the content validity of the most widely used adult cancer questionnaire, the EORTC QLQ-C30, is currently being evaluated with adult [111] and adolescent cancer patients [112]. For the pediatric PROMs included in the present review, almost no content validity studies were available.
Lacking qualitative evidence, investigators take the mere use of questionnaires as an indicator of content validity. For example, Arabiat et al. state that "Face and content validity were assumed because the Ped-sQL ™ (4.0) is widely used and reported in quality of life research" [83]. Despite strong recommendations for patient involvement, there are several barriers for qualitative research. Applying qualitative methods is partly a question of resources (i.e., financial means, infrastructure, collaborations, expertise, etc.). For example, Petersen et al., who interviewed children during the development procedure of the DISABKIDS, concluded that "these techniques are a helpful method. Nevertheless, the amount of time necessary to carry this out and analyze it is a weakness of this approach" [69]. Despite these challenges, qualitative methods are crucial, because content validity is a question of heuristics that cannot be resolved by quantitative methods.

Missing clarity about the concept of content validity
Another reason for missing research on content validity might be that this measurement property has been the subject of scientific dispute [113]. Following critique from modern test theory, guidelines seemingly struggled to redefine the concept and to identify methods for its assessment [113,114]. It is only in the latest version of the COSMIN methodology that content validity is clearly described by the three components of relevance, comprehensiveness, and comprehensibility, and that corresponding standards and criteria are defined [21,22]. This new and clear definition and the high requirements of the recent COSMIN guidelines make a considerable difference. Wayant et al. [35], who used the new methodology, found the same lack of evidence highlighted by our review. This is in contrast with reviews based on the older version, which came to very positive results [e.g., 34].
As the operationalization of content validity by relevance, comprehensibility, and comprehensiveness is still young, studies so far have seldom covered all three components separately and equally. For example, Kudubes and Bektas [67] asked health-care professionals only to rate how much change was needed for each item, without specifying what kind of change was required and why. If studies made a distinction between the three components, comprehensiveness was less often investigated compared to relevance and comprehensibility. This is in line with a recent review of studies on measurement properties of PROMs, which found that 77.8% of the studies assessed relevance, 48.2% evaluated comprehensibility, and only 3.7% focused on comprehensiveness [115].
When it comes to comprehensibility, there is again a lack of differentiation. Wayant et al. [35] state that instructions were not investigated for any of the PROMs included in their review; rather, the studies focused solely on items. In our review, the PROMIS Pediatric Profile is the only tool for which items, instructions, responseoptions, and recall-periods were assessed separately [85]. For the KINDL Generic Module, which was developed a decade earlier, comprehensibility was not even rated per item, but for the whole questionnaire [76].

'Doubtful' ratings of study quality due to poor reporting
Not only is there a lack of qualitative studies of high quality for assessing content validity, but most 'doubtful' ratings were given due to insufficient reporting. In several cases, development and cognitive interview studies were only briefly described in a paragraph of a later study focusing on quantitative validity or reliability testing. Such shortcomings in reporting of qualitative methods in PROM development are a well-known problem and not specific to the field of pediatric oncology [116]. The recently published COSMIN reporting guideline will hopefully improve the situation [117]. However, it gives only very loose rules for content validity studies, defining what must be reported. It does not provide guidance on how much detail is required to meet the criteria of the COSMIN methodology for assessing content validity. Therefore, it might be useful to also have this methodology in mind when developing a new instrument. Even though Gagnier et al. differentiate clearly between the scopes of the two guidelines [117], it would surely help to prepare, conduct, and report future research more effectively and to provide more solid evidence.

Limitations and challenges of applying the COSMIN methodology on content validity assessment
We are aware that the search strategy underlying this review was limited. The search was conducted in only one database, PubMed, and did not rely on the extensive search filter by COSMIN [118]. This filter, however, is designed to find studies reporting all psychometric properties and not specifically content validity. Thus, the results would have exceeded the scope of our review. That no further PROMs could be identified through cross-checking with very comprehensive reviews [44,45] indicates that our search was sufficiently fit for identifying relevant PROMs. Corresponding development and content validity studies are usually referred to as primary citations. Beyond that, we conducted additional searches and contacted PROM designers and authors to make sure that no relevant studies were missed.
While the COSMIN methodology is the current gold standard for assessing the quality criteria of PROMs, its application was partly challenging. Not only is the reporting inconsistent and insufficient, but the differentiation between cognitive interview and content validity studies is sometimes difficult to make. Furthermore, the COSMIN guidelines propose rating each subscale separately [22]. This was rarely possible, because most of the multidimensional PROMs were developed as a whole and the information was not given per subscale. Even for the PROMIS Pediatric Profile, for which subscales were developed separately, not all steps and results were reported for each subscale in detail. These uncertainties led to many 'doubtful' ratings. Since the COSMIN methodology follows the worst-score-counts-principle, one 'doubtful' rating results in a 'doubtful' overall rating. This principle could be criticized for being too strict, as less relevant deficiencies could outweigh more important standards that were well met.
The situation is further complicated because the guidelines were not developed for pediatric tools and do not provide any advice on how to consider evidence provided by caregivers. We tried to resolve this by adding the standards required for expert involvement in content validity studies to take caregiver interviews into account. One could argue that caregivers' input should also have been considered in concept elicitation or cognitive interview studies. However, as caregiver-and patient-report often differ considerably, we decided to not systematically consider input from caregivers during these stepsin exactly the same way that the opinions of health-care professionals are ignored at this point following the COS-MIN guidelines.

Conclusion and implications
Following the COSMIN methodology, this systematic review showed that there is only fragile evidence for the content validity of PROMs for HRQOL in children with cancer. Only the PROMIS Pediatric Profile has a 'moderate' level of evidence. Results indicate that it covers relevant issues and is comprehensible. Its comprehensiveness could be improved by adding further pediatric PROMIS scales (e.g., cognitive function, meaning and purpose, life satisfaction, positive affect) [43]. Thus, among the investigated PROMs, the Pediatric PROMIS Profile is recommended. However, this instrument is not disease-specific, and it might be worthwhile conducting a qualitative content validity study in children with cancer.
This lack of evidence can be explained by several factors: Most investigated instruments were developed before the publication of milestone policies and guidelines. Learning from the strengths and limitations of said previous PROM developments, these guidelines set new methodological standards. Content validity, in particular, was only clearly defined in the latest version of the COSMIN methodology. While it is, therefore,   I A CV study in children with chronic diseases (asthma, diabetes, not cancer), was 'doubtful' , as they rated the comprehensibility and relevance of the questionnaire as a whole instead of single items [76]. Further validation based on quantitative methods only [49,50].
As it relied solely on quantitative methods to analyze reliability and construct validity, the paper by Ergin et al. [60] can not be taken into account as a content validity study.  affect). However, they did not report overall scores for relevance ratings, but only differences between child-and parent-ratings.  This section was added to take the central role of parents in pediatric healthcare into account. As it is not required by the COSMIN Guidelines, 'inadequate' ratings were only given when parents were involved using 'inadequate' methods. If content validity studies were performed without parents, parent involvement was not rated      1 Following the COSMIN guidelines, evidence was 'low' , because the development procedure is 'doubtful' and no content validity study is available. However, no study results are available, so that rating of results is solely based on reviewers' ratings. Thus, the evidence was down-graded to 'very low' . This is in line with a current review by Bull et al. [37], who state that no information on the content validity is provided for the PedsQL Brain Tumor Module. understandable that previous projects did not fulfill all required standards, PRO and HRQOL research in pediatric oncology should still try to catch up with the scientific and methodological progress of the last decade. Therefore, we argue that further efforts are needed to provide PROMs for HRQOL assessment in children with cancer that are based on solid evidence. This could include the development of new instruments, as well as performing content validity studies to strengthen the evidence for already-existing PROMs. In each case, it is strongly recommended that existing guidelines on qualitative methods and reporting standards for these study types be adhered to. Within the EORTC QLG, we are currently developing an HRQOL questionnaire for children with cancer [119]. Following the EORTC QLG module development guidelines [23], this involves not only a literature review [45], but also in-depth interviews with children with cancer, their parents, and health-care professionals.